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C/3 ] Abstract 

, A rich coreset is a subset of the data which contains nearly all the essential information. We give 

deterministic, low order polynomial-time algorithms to construct rich coresets for simple and multiple 
response linear regression, together with lower bounds indicating that there is not much room for 
improvement upon our results. 

>: 

^ ; 1 Introduction 

Linear regression is an important technique in data analysis (Seber and Lee, 1977). Research in the area 
£NJ ■ ranges from numerical techniques (A. Bjorck, 1996) to robustness of the prediction error to noise (e.g. 
using feature selection (Guyon and Elisseeff, 2003)). 

Is it possible to efficiently identify a small subset of the data that contains all the essential information 
of a learning problem? Such a subset is called a "rich" coreset. We show that the answer is yes, for linear 
regression. Such a rich coreset is analogous to the support vectors in support vector machines (Cristianini 
and Shawe- Taylor, 2000). Such rich coresets contain the meaningful or important points in the data and 
can be used to find good approximate solutions to the full problem by solving a (much) smaller problem. 
When the constraints are complex (e.g. non-convex constraints), solving a much smaller regression problem 
could be a significant saving (Gao, 2007). 

We present coreset constructions for constrained regression (both simple and multiple response), as well 
as lower bounds for the size of "rich" coresets. In addition to potential computational savings, a rich coreset 
identifies the important core of a machine learning problem and is of considerable interest in applications 
with huge data where incremental approaches are necessary (eg. chunking) and applications where data 
is distributed and bandwith is costly (hence communicating only the essential data is imperative). 

Our first contribution is a deterministic, polynomial-time algorithm for constructing a rich coreset for 
arbitrarily constrained linear regression. Let k be the "effective dimension" of the data and let e > be 
the desired accuracy parameter. Our algorithm constructs a rich coreset of size O (fe/e 2 ) , which achieves a 
(1 + e)-relative error performance guarantee. In other words, solving the regression problem on the coreset 
results in a solution which fits all the data with an error which is at most (1 + e) worse than the best 
possible fit to all the data. We extend our results to the setting of multiple response regression using more 



1 



sophisticated techniques. Underlying our proofs are two sparsification tools from linear algebra (Batson 
et al., 2009; Boutsidis et al., 2011), which may be of general interest to the machine learning community 

1.1 Problem Setup 

Assume the usual setting with n data points (zi, yi), . . . , (z n , y n ); Zj £ R d are feature vectors (which 
could have been obtained by applying a non-linear feature transform to raw data) and yi £ R are targets 
(responses). The linear regression problem asks to determine a vector x op t £ T> C R rf that minimizes 

n 
i=l 

over x£P, where Wi are positive weights. So, £(x opt ) < £(x), for all x £ 2?. The domain P represents 
the constraints on the solution, e.g., in non-negative least squares (NNLS) (Lawson and Hanson, 1974; 
Bellavia et al., 2006), T> = R^, the nonnegative orthant. Our results hold for arbitrary V. 

A coreset of size r < n is a subset of the data, (z^,^), . . . , (zj r ,yj r ). The coreset regression problem 
considers the squared error on the coreset with a, possibly different, set of weights Sj > 0, 

r 

^( x ) = X^'( z J - x -^i) 2 - 

Suppose that £ is minimized at x op i, so £(5c pt) < £( x )> for all x £ V. The coreset is rich if, for some 
set of weights Sj, x op t is nearly as good as x op t for the original regression problem on all the data, i.e., for 
some small e > 0, 

£ (x^) < £ (xopt) < (1 + e)£(x opt ). 

The algorithm which constructs the coreset should also provide the weights Sj. For the remainder of 
the paper, we switch to an equivalent matrix formulation of the problem. We give some linear algebra 
background in the Appendix. 

Matrix Formulation. Let A £ R nxd be the data matrix whose rows are the weighted data points 
tJWiL^ and b £ R n is the similarly weighted target vector, 6j = yjwlyi- The effective dimension of the 
data can be measured by the rank of A; let k = rank(A).Our results hold for arbitrary d, however, in 
most applications, n> d and rank(A) ~ d. We can rewrite the squared error as £(x) = ||Ax — t» 1 1 2 , so, 

x opt = argmin ||Ax — b|||. 

A coreset of size r < n is a subset C £ W xd of the rows of A and the corresponding elements b c £ R r 
of b. Let D £ R rxr be a positive diagonal matrix for the coreset regression (the weights Sj of the coreset 
regression will depend on D). The weighted squared error on the coreset is given by £(x) = ||D(Cx— t»e) 1 1 2 ? 
so the coreset regression seeks x op t defined by 

x opt = argmin ||D (Cx - b c ) \\\. 

x6X> 

We say that the coreset is (1 + e)-rich if the solution obtained by fitting the coreset data can fit all the 
data almost optimally. Formally, 

||Ax opt - h\\l < ||Ax opt - h\\ 2 2 < (1 + e)||Ax opi - b||^. 
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1.2 Our contributions 



Constrained Linear Regression (Section 2). Our main result for constrained simple regression is 
Theorem 1, which describes a deterministic polynomial time algorithm that constructs a (l + e)-rich coreset 
of size O (fc/e 2 ). Prior to our work, the best result achieving comparable relative error performance guar- 
antees is Theorem 1 of (Boutsidis and Drineas, 2009) for constrained regression, and the work of (Drineas 
et al. , 2006) for unconstrained regression. Both of these prior results construct coresets of size O (k log k/e 2 ) 
and they are randomized, so, with some probability, the fit on all the data can be arbitrarily bad (despite 
the coreset being a logarithmic factor larger). Our methods have comparable, low order polynomial run- 
ning times and provide deterministic guarantees. The results in (Drineas et al., 2006) and (Boutsidis and 
Drineas, 2009) are achieved using the matrix concentration results in (Rudelson and Vershynin, 2007). 
However, these concentration bounds break unless the coreset size is at least ft (fclog(fc) / e 2 ) . We give the 
first algorithms that break the k log k barrier. 

We extend our results to multiple response regression, where the target is a matrix B G R nxw with 
oj > 1. Each column of B is a seperate target (or response) that we wish to predict. We seek to minimize 
|| AX — B|| over all X G T> C R dxw , and some matrix norm || • || . Multiple response regression has numerous 
applications, but is perhaps most common in multivariate time series analysis; see for example (Hamilton, 
1994; Breiman and Friedman, 1997). To illustrate, consider prediction of time series data: let Z G M( n + 1 ) xci 
be a set of d time series, where each column is a time series with n + 1 time steps; we wish to predict time 
step t + 1 from time step t. Let A contain the first n rows of Z and let B contain the last n rows. Then, 
we seek X that minimizes ||AX — B||, which is exactly the multiple response regression problem. In our 
work, we focus on the spectral norm || • ||2 and the Frobenius norm || • ||p, the two most common norms in 
matrix analysis. 

Mult i- Objective Regression (Section 3.1). An important variant of multiple regression is the so- 
called multi-objective regression. Let B = [bi, . . . , h u ] G R nXcJ , where we explicitly identify each column 
in B as a target response bj. We seek to simultaneously fit multiple target vectors with the same x, 
i.e. to simultaneously minimize ||Ax — bj 1 1 2 where j G {1, 2, . . . , co}. This is common when the goal is to 
trade off different quality criteria simultaneously. Writing X = [x, x, . . . ,x] G M. dXul (u> copies of x), we 
consider minimizing || AX — B|| F , which is equivalent to multiple regression with a strong constraint on 
X. We present results for coreset constructions for the Frobenius-norm multi-objective regression problem 
in Theorem 4, which describes a deterministic algorithm to construct (l + e)-rich coresets of size 0{k/e 2 ). 
Theorem 4 emerges by applying Theorem 1 after converting the Frobenius-norm multi-objective regression 
problem to a simple response regression problem. 

Arbitrarily-Constrained Multiple- Response Regression (Section 3.2). Using the same approach, 
converting the problem to a single response regression, we construct a (1 + e)-rich coreset for Frobenius- 
norm arbitrarily-constrained regression in Section 3.2. The coreset size here is O (fcw/e 2 ). 

Unconstrained Multiple- Response Regression (Section 4). In Section 4, we consider rich coresets 
for unconstrained multiple regression for both the spectral and Frobenius norms. The sizes of the coresets 
are smaller than the constrained case, and our main results are presented in Theorems 6 and 7. Theorem 6 
presents a (2 + e)-rich coreset of size 0((k + uj)/e 2 ) for spectral norm regression, while Theorem 7 presents 
a (2 + e)-rich coreset of size 0(k/e 2 ) for Frobenius norm regression. 

Lower Bounds (Section 5). Finally, in Section 5, we present lower bounds on coreset sizes. In the 
single response regression setting, we note that our algorithms need to look at the target vector b. We 
show that this is unavoidable, by arguing that no b-agnostic deterministic coreset construction algorithm 
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can construct rich coresets which are small (Theorem 11). We also present similar results for b-agnostic 
randomized coreset constructions (Theorem 12). Having shown that we cannot (in general) be b-agnostic, 
we present lower bounds on the size of rich coresets for spectral and Frobenius norm multiple response 
regression that apply in the non b-agnostic setting (Theorems 13 and 14). 



2 Constrained Linear Regression 

We define constrained linear regression as follows: given A G W lXd of rank k, b G W 1 , and T> C M. d , we 
seek x op j G T> for which || Ax^j — t* 1 1 2 — II Ax — b|| for all x G T> (the domain T> represents the constraints 
on x and can be arbitrary). To construct a coreset C G W xd (i.e., C consists of a few rows of A) and 
b c G M. r (i.e., b c consists of a few elements of b), we introduce sampling and rescaling matrices S and D 
respectively. More specifically, we define the row-sampling matrix S G W xn whose rows are basis vectors 
ej , . . . ,ej . Our coreset C is now equal to SA; clearly, C is a matrix whose rows are the rows of A 
corresponding to indices ii,...,i r . Similarly, b c = Sb contains the corresponding elements of the target 
vector. Next, let D G W xr be a positive diagonal rescaling matrix and define the D-weighted regression 
problem on the coreset as follows: 

x opt = argmin ||D (Cx — b c ) ||!j = argmin ||DS (Ax — b) \\ 2 . (1) 

xGD xGX> 

In the above, the operator DS first samples and then rescales rows of A and b. Theorem 1 is the main 
result in this section and presents a deterministic algorithm to select a rich coreset by constructing the 
matrices D and S. (All algorithms are given in the Appendix.) 

Theorem 1. Given A G W ixd of rank k, b G M n , and V C M. d , Algorithm 1 constructs matrices S G W xn 
and D G W xr (for any r > k + 1) such that i op t of eqn. (1) satisfies 



|Ax opt - b||| < r + k + l + 2y/r{k + l) =1 ^ 




II Axpp* — b||| r + k + l-2y/r(k + l) 

The running time of the proposed algorithm is T (U[A,b]) + O (rn/c 2 ), where T (U[A,b]) * s the time needed 
to compute the left singular vectors of the matrix [A, b] G R nx ( d+1 ). 

For any < e < 1, we can set r = k/e 2 to get an approximation ratio roughly equal to 1 + 4e. This 
result considerably improves the result in (Boutsidis and Drineas, 2009), which needs r = 0(klog(k)/e 2 ) 
to achieve the same approximation. Additionally, our bound is deterministic, whereas the bound in (Bout- 
sidis and Drineas, 2009) fails with constant probability. (Boutsidis and Drineas, 2009) requires an SVD 
computation in the first step, so its running time is comparable to ours. 

In order to prove the above theorem, we need a linear algebraic sparsification result from (Batson et al., 
2009), which we restate using our notation. 

Lemma 2 (Single-set Spectral Sparsification (Batson et al., 2009)). Given U G M. nxi satisfying U T U = Lj 
and r > I, we can deterministically construct sampling and rescaling matrices S and D such that, for all 
y G R £ : 



The algorithm runs in 0(rn£ 2 ) time and we denote it as [D,S] = SimpleSampling(XJ , r) 
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Proof, (of Theorem 1) Let Y = [A,b] G M nx ( d+1 ) and compute its SVD: Y = USV T . Let £ be the 
rank of Y (£ < k + 1, since rank(A) = k) and note that U G R nx£ , £ G M^, and V G K^ 1 )**. Let 
[D, S] = SimpleSampling(XJ , r) and define yi,y2 G as follows: 



yi = sv T 



-1 



and y 2 



-1 



Note that Uyi = Ax opt -b, Uy 2 = Ax opt -b, DSUyi = DS (Ax opt - b), and DSUy 2 = DS (Ax op( - b). 
We will bound ||Uy2|| in terms of ||Uyi||: 

x 2 (a) (b) (c) / x 2 

l-y/tpi ||Uy 2 ||i< ||DSUy 2 |||<||DSUyi||l< (l + y/lfi) ||XJ yi |||. 



(a) and (c) follow from Lemma 2; (b) follows from the optimality of x op t for the coreset regression in 
eqn. (1). Using I < k + 1 and manipulating the above expression concludes the proof of the theorem. The 
running time of the algorithm is equal to the time needed to compute U and the time needed to run the 
algorithm of Lemma 2 with I < k + 1. ■ 

3 Constrained Multiple-Response Regression 

Constrained multiple-response regression in the Frobenius norm can be reduced to simple regression. So, 
we can apply the results of the previous section to this setting. 

3.1 Multi-Objective Regression 

The task is to minimize, over all x G T>, the Frobenius-norm error ||A[x, . . . ,x] — B|| F . Let h avg = — Bl^ 
(here l u is a vector of all ones and thus h avg is the average of the columns in B). Recall that A G IR nxrf , 
B G R nXuJ , and let X = [x, . . . , x] G M dxuJ . 



Lemma 3. For X = [x, . . . , x] G R dXuJ , ||AX - B||| = w||Ax - b avg \\l + ^ \\b avg - B (i) 



2 
2' 

i=l 



In the above denotes the i-th column of B as a column vector. Note that the second term in Lemma 
3 does not depend on x and thus the generalized multi-objective regression can be reduced to simple 
regression on A and h avg . Using Theorem 1, we can get a coreset: let x opt minimize ||DS (Ax — h avg ) || 2 , 
where S and D are obtained via Theorem 1 applied to A and b avg . If X opt = [x opi , . . . , x opt ], then, 
by Lemma 3, X op j minimizes ||DS (AX — B) || F . Similarly, if x op t minimizes ||Ax — b ai , 5 || 2 and X op j = 
[x op t, . . . ,x op t], then X op j minimizes ||AX — B||^. Theorem 4 says that X op j is approximates X op j (the 
proof is in the appendix). 

Theorem 4. Given A G W nxd of rank k and B G W lXuJ , we can construct matrices S G W xn and D G M rxr 
(for any r > k + 1 ) such that the matrix X op t = [x op j, . . . , i op t] that minimizes ||DS (AX — B) || F over all 
matrices X = [x, x, . . . , x] satisfies: 

\\AXopt - B||p < (l + O (v 7 ^)) l|AX opt - B|| F . 

The run time of the proposed algorithm is T (U[ A ,b „ ff ]) + O { nuj + rnk 2 ), where T (U[A,b avg ]) ^ s ^ e ^ me 
needed to compute the left singular vectors of the matrix [A, h avg ] G ]R nx ( rf+1 ). 

We note that the coreset size depends only on the rank of A and not on the size of B. 
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3.2 Arbitrarily-Constrained Multiple-Response Regression 



Multi-objective regression is a special case of constrained multiple-response regression for which we can 
efficiently obtain the coresets. In the general case, the problem still reduces to simple regression, but the 
coresets are now larger. We wish to minimize ||AX — B|| F over XsDC R q!><w . Since M. dXuj is isomorphic 
to M. dul , we can view X £ R dXLJ as a "stretched out" vector X £ M. dul ; corresponding to the domain T> is the 
domain T> C M. duJ . Similarly, we can stretch out B £ R nxw to B £ ]R nw . To complete the transformation 
to simple linear regression, we build a transformed block-diagonal data matrix A from A, by repeating ui 
copies of A along the diagonal: 



£ 



r> nu) X dw 



X 



X( 2 ) 



odu) 



B 



B( 2 ) 



Lemma 5. For all A, X and B of appropriate dimensions, || AX — B 



IAX-B 



Theorem 1 gives us coresets for this equivalent regression. Note that rank(A) < uj ■ rank(A). The coreset 
will identify the important rows of A (the same row may get identified multiple times as different rows of 
A), and the important elements of B, because the entries in B are elements of B, not rows of B. Let X op i 
be the solution constructed from the coreset, which minimizes ||AX — B|| over X £ T>, and let X op t £ D 
be the corresponding solution in the original domain T>. If r is the size of the coreset and rank(A) = k, 
then, by Theorem 1, 

||AX opt -B||p < (l + O 



ku/r 



I AX 



opt 



B 



So, for the approximation ratio to be 1 + 0(e), we set r = O (fcw/e 2 ). The running time would involve the 
time needed to compute the SVD of [A, B]. 

Notice that the coresets are large and somewhat costly to compute and they only work for the Frobe- 
nius norm. In the next section, using more sophisticated techniques, we will get smaller coresets for 
unconstrained regression in both the Frobenius and spectral norms. 



4 Unconstrained Multiple-Response Regression 

Consider the following problem: given a matrix A £ R nxa! with rank exactly k and a matrix B £ M™^, 
we seek to identify the matrix X op ( £ M. dXul that minimizes (£ = 2 and £ = F) 

X op ( = arg min ||AX — Bll 2 . 

xeR dx " ? 

We can compute X op £ via the pseudoinverse of A, namely X op ^ = A + B. If S and D are sampling and 
rescaling matrices respectively, then the coreset regression problem is: 

X op t = arg min IIDSAX - DSBll 2 (2) 

xeR dx " ? 

The solution of the coreset regression problem is X opt = (DSA) + DSB. The main results in this section 
are presented in Theorems 6 and 7. 
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Theorem 6 (Spectral norm). Given a matrix A £ M. nxd with rank exactly k and a matrix B £ R nXu; , 
Algorithm 2 deterministically constructs matrices S and D such that solving the problem of eqn. (2) satisfies 
(for any r such that k + 1 < r < n): 



i 1 + ^Juj/r\ 

|AX opt — B || 2 < || AX opi — B || 2 + -== || AXop ( — B || 2 - 



1 — yk/r 

The running time of the proposed algorithm is T (Ua) + {rn {k 2 + u 2 )) , where T (Ua) is the time needed 
to compute the left singular vectors of A. 

Asymptotically, for large cj, the approximation ratio of the above theorem is O (uj/r). We will argue 
that this is nearly optimal by providing a matching lower bound in Theorem 13. 

Theorem 7 (Frobenius norm). Given matrices A £ M. nxd of rank k and B £ ]R nxw , Algorithm 3 de- 
terministically constructs a sampling matrix S and a rescaling matrix D such that solving the problem of 
eqn. (2) satisfies (for any r such that k + 1 < r < n): 



|AXw — B||p < ||AX,w — B||p H — -^t||AX,w — B" 2 



1 

opt — -D||F ^ W-n-A-Opt ~ ^>\\F "I" ~ r, ||^.^V pf — X3||p. 

k/r 

The running time of the proposed algorithm is T(Ua) + O (rnfc 2 ), where T (Ua) is the time needed to 
compute the left singular vectors of A. 

The approximation ratio in the above theorem is 2 + O(^kfr). In Theorem 14, we will give a lower 
bound for the approximation ratio which is 1 + Vt(k/r). We conjecture that our lower bound can be 
achieved, perhaps by a more sophisticated algorithm. 

Finally, we note that the ^-agnostic randomized construction of Drineas et al. (2008) achieves a (1 + e) 
approximation ratio using a significantly larger coreset, r = 0{k log(/c)/e 2 ). Importantly, they do not need 
any access to B in order to construct the coreset, whereas our approach constructs coresets by carefully 
choosing important data points with respect to the particular target response matrix B. We will also 
discuss B-agnostic algorithms in Section 4.2 (Theorem 10) and we will present matching lower bounds in 
Section 5. 

4.1 Proofs of Theorems 6 and 7 

We will make heavy use of facts from Section A. We start with a few simple lemmas. 
Lemma 8. Let E = AX^j — B be the regression residual. Then, rank(E) < mm{u,n — k}. 

Proof. Using our notation, AX opi — B = (l„ — UaU^) B = U A (U^) T B. To conclude notice that 
rank(XY) < min{rank(X), rank(Y)}. ■ 

We now give our main tool for obtaining approximation guarantees for coreset regression. The proof is 
deferred to the appendix. 

Lemma 9. Assume that the rank of the matrix DSUa £ W xk is equal to k (i.e., the matrix has full 
rank). Then, for £ = 2, F, 

||AX ojrf - B\\l < \\AX opt - B||| + ||(DSU A ) + DS (AX opi - B) |||. 
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This lemma provides a framework for coreset construction: all we need are sampling and rescaling matrices 
S and D, such that rank(DSU A ) = k and || (DSUa) + DS (AX^ — B) ||| is small. The final ingredients 
for the proofs of Theorems 6 and 7 are two matrix sparsification results, Lemmas 16 and 17 in the 
Appendix. 

Proof, (of Theorem 6) Theorem 6 follows from Lemmas 9 and 16. First, compute the SVD of A to obtain 
U A , and let E = AX opt - B = U A UjB - B. Next, run the algorithm of Lemma 16 to obtain [fi, S] = 
MultipleSpectralSampling (Ua,E,t). This algorithm will run in time T$vd (E) + O (rn (k 2 + p|j)) , 
where k is the rank of U A and A. The total running time of the algorithm is T(U A ) + T$vd (E) + 
O (rn (k 2 + pi)) = T (U A ) + O (rn (k 2 + uj 2 )). 

Lemma 16 guarantees that D and S satisfy the rank assumption of Lemma 9. To conclude the proof, 
we bound the second term of Lemma 9, using the bounds of Lemma 16 and /?e < minjcj, n — k} < uj: 

||(DSU A )+DS(AX opt -B)||2 < ||(DSU A ) + |||||DS(AX opt -B)||2 

2 / , \ -2 



< 1 + 7^ 1-^^77 ||AX op t-B|||. 



Proof, (of Theorem 7) Similar to Theorem 6, using Lemma 17 instead of Lemma 16. ■ 
4.2 B-Agnostic Coreset Construction 

All the coreset construction algorithms that we presented so far carefully construct the coreset using 
knowledge of the response vector. If the algorithm does not need knowledge of B to construct the coreset, 
and yet can provide an approximation guarantee for every B, then the algorithm is B-agnostic. A B- 
agnostic coreset construction algorithm is appealing because the coreset, as specified by the sampling and 
rescaling matrices S and D, can be computed off-line and applied to any B. We briefly digress to show 
how our methods can be extended to develop B-agnostic coreset constructions. 

Theorem 10 (B-Agnostic Coresets). Given a matrix A G W nxd with rank exactly k and a matrix B G 
JR nxw , there exists an algorithm to deterministically construct a sampling matrix S and a rescaling matrix 
D such that for any B G K nXLJ , the matrix X op t that solves the problem of eqn. (2) satisfies (for any r 
such that k < r < n): 

~ 2 2 / 1 + yjn/r\ 2 

1 1 AX^t — B|L < ||AX op t — B|L + -== ||AX op t — B|b. 

y 1 — \fk/r J 

The running time of the proposed algorithm is T(U A ) + O (rnk 2 ^, where T (TJ A ) is the time needed to 
compute the left singular vectors of A . 

Proof. The proof is similar to the proof of Theorem 6, except we now construct the sampling and rescaling 
matrices as [D,S] = MultipleSpectralSampling (U A ,I n , r) . To bound the second term in Lemma 9, we 
use 

||(DSU A ) + DS(AX op t-B)||! = ||(DSU A ) + DSI n (AX op t-B)||| 

< || (DSU A ) + |||||DSI n ||||| (AX op t - B) |||, 
and the bounds of Lemma 16. ■ 

The above bound decreases with r and holds for any B, guaranteeing a constant-factor approximation 
with a constant fraction of the data. The approximation ratio is 0(n/r), which seems quite weak. In the 
next section, we show that this result is tight. 
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5 Lower Bounds on Coreset Size 



We have just seen a B-agnostic coreset construction algorithm with a rather weak worst case guarantee 
of 0{n/r) approximation error. We will now show that no deterministic B-agnostic coreset construction 
algorithm can guarantee a better error (Theorem 11). 

(Drineas et al., 2008) provides another B-agnostic coreset construction algorithm with r = 0{k log(/c)/e 2 ). 
For a fixed B, the method in (Drineas et al., 2008) delivers a probabilistic bound on the approximation 
error. However, there are target matrices B for which the bound fails by an arbitrarily large amount. 
The probabilistic algorithms get away with this by brushing all these (possibly large) errors into a low 
probability event, with respect to random choices made in the algorithm. So, in some sense, these algo- 
rithms are not B-agnostic, in that they do not construct a coreset which works well for all B with some 
(say) constant probability. Nevertheless, the fact that they give a constant probability of success for a 
fixed but unknown B makes these algorithms interesting and useful. We will give a lower bound on the 
approximation ratio of such algorithms as well, for a given probability of success (Theorem 12). Finally, we 
will give lower bounds on the size of the coreset for the general (non-agnostic) multiple regression setting 
(Theorems 13 and 14). 

5.1 An Impossibility Result for B-Agnostic Coreset Construction 

We first present the lower bound for simple regression. Recall that a coreset construction algorithm is 
b-agnostic if it constructs a coreset without knowledge of b, and then provides an approximation guarantee 
for every b. We show that no coreset can work for every b; therefore a b-agnostic coreset will be bad for 
some vector b. In fact, there exists a matrix A such that every coreset has an associated "bad" b. 

Theorem 11 (Deterministic b-Agnostic coresets). There exists a matrix A £ M. nxd such that for every 
coreset C G W xd of size r < n, there exists h 6 R™ (depending on C) for which 

\\Ax opt - b||| > ^||Ax opt - b|||. 

Proof. Let A be any matrix with orthonormal columns whose first column is l n /y/n, and consider any 
coreset C of size r. Let b = Itj/V n — r, where 1q is the n- vector of l's except at the coreset locations. 
So for the coreset regression, b c = 0, and so x op t = O^xi- Therefore, ||Ax opt — b||| = = 1. Let 

Pa project onto the columns of A and P A (i) project onto the first column of A. The following sequence 
establishes the result: 

||Ax opt - b||l = ||(I - P A )b||l < ||(I - P A( i))b||| = - 



We now consider randomized algorithms that construct a coreset without looking at b (e.g. (Drineas et al., 
2008)). These algorithms work for any fixed (but unknown) b, and deliver a probabilistic approximation 
guarantee for any single fixed b; in some sense they are b-agnostic. By the previous discussion, the 
returned coreset must fail for some b, i.e., the probabilistic guarantee does not hold for all b) and, when 
it fails, it could do so with very bad error. We will now present a lower bound on the approximation 
accuracy of such existing randomized algorithms for coreset construction, even for a single b. 

First, we define randomized coreset construction algorithms. Let Ci, C2, . . . , C/«\ be the (™) different 
coresets of size r. A randomized algorithm assigns probabilities pi,p%, . . . ,p/ n \ to each coreset, and selects 
one according to these probabilities. The probabilities pi may depend on A. The algorithm is b-agnostic 
if the probabilities pi do not depend on b. 
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Theorem 12 (Probabilistic b- Agnostic Coresets). For any randomized h-agnostic coreset construction 
algorithm, and any integer < I < n — r, there exists A G ]R nxd and b G W 1 , such that, with probability 
at least ( n /)/(™)> 

ii iii2 ^ ii* iii9 

||Ax opt - b|| 2 > ^— ^||Ax op i - b|| 2 . 

Proof. Let A be any matrix with orthonormal columns whose first column is l n /y/n, as in the proof of 
Theorem 11. Let T be a set of size I < n — r. The neighborhood N(T) is the set of coresets that have 
non-empty intersection with T. Every coreset appears in (") — ( n ^ r ) such neighborhoods (the number 
of sets of size i which intersect with a coreset of size r). Let Pr [T] be the probability that the coreset 
selected by the algorithm is in N (T); then, Pr [T] = J2ci£N(T) ^ r l^i]- Therefore, 

T T Ci£N(T) 

where the last equality follows because each coreset appears exactly ( ™ ) — ( n J r ) times in the summation 
and X^i-P 1- Pi] = 1- Thus, there is at least one set T* for which 

l n\ / n—r\ /n—r\ 

Pr [C G N(X)\ < ' ] = 1 " W- 

So with probability at least ( n 7 r ) / \ i ) ■> * ne selected coreset does not intersect with T*. Select b = It* (the 
unit vector which is 1/y/I at the indices corresponding to T*). Now, with probability at least ( n J r ) / ("), 
x op t = 0, and the analysis in the proof of Theorem 11 shows that || Ax op t — 6||| > r^jH Ax^ — 1 1 2 - B 

By Stirling's formula, after some algebra, the probability ( n J r ) / ( «) is asymptotic to e~ 2ri / n . Setting 
£ = Q(n/r) gives a success probability that is (a constant), then the approximation ratio cannot 

be better than 1 + f2(l/r). With regard to high probability (approaching one) algorithms, consider I = 
nlogn/2r to conclude that if the success probability is at least 1 — 1/n, the approximation ratio is no 
better than 1 + log(n)/(2r — logra). 




5.2 Lower Bounds for Non-Agnostic Multiple Regression 

For both the spectral and the Frobenius norm, we now consider non-agnostic unconstrained multiple 
regression, and give lower bounds for coresets of size r > d = rank(A) (for simplicity, we set rank(A) = d). 
The results are presented in Theorems 13 and 14. 

Theorem 13 (Spectral Norm). There exists A G M nxa! and B G M™ xw such that for any r > d and 
any sampling and reseating matrices S G W xn and D G ]R rxr , the solution to the coreset regression 
Xopt = (DSA)+DSB G R dXuJ satisfies 

||AX opt — B|L > — — ||AX op t — B || 2 - 
z r + 1 

Proof. First, we need some results from (Boutsidis et al., 2011). Boutsidis et al. (2011) exhibits a matrix 
B G Rv £J_1 ) xa ' such that for any sampling matrix S G ]R r ' x ( a;_1 ) and rescaling matrix D G M rxr , with 
C = DSB (rescaled sampled coreset of B), 

||B-n Cl fc(B)||l > -^-HB-BfclH, 
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where Hc,fc(B) is the best rank-/c approximation to B whose rows lie in the span of all the rows in C 
(the row-space of C); and, B^ is the best rank-/c approximation to B (which could be computed via the 
truncated SVD of B). Actually, D is irrelevant here because the row-space of SB is not changed by a 
positive diagonal rescaling matrix D. 

Since IIc,fc(B) is the best rank-/c approximation to B in the row-space of C, it follows that ||B — 
IIc,fc(B)||2 < ||B — XC||| for any Xg R( a;_1 ) xr with rank at most k (because XC will have rank at most 
k and is in the row space of C). Set X = U B ,fc(DSU B ,fc) + , where U B , fc G R^" 1 )** has k columns which 
are the top-A: left singular vectors of B. It is easy to verify that X has the correct dimensions and rank at 
most k. Since C = DSB, we have that 

||B - n ,fc(B)||l < ||B - U B , fe (DSU B , fc ) + DSB||2. 

We now construct the regression problem. Let A = U B c ; G M( w ~ 1 ) xd (i.e., we choose k = d in the above 
discussion and n = oj — 1. Suppose a coreset construction algorithm gives sampling and rescaling matrices 
S and D, for a coreset of size r. So, the coreset regression is with A = C = DSA and B = DSB. The 
solution to the coreset regression is 

X opt = A+B = C+DSB = (DSA)+DSB = (DSU Bd )+DSB, 

which means that 

||AX opi - B\\l = ||U Bjd (DSU Bid ) + DSB - B||| > ||n c , d (B) - B||| > ^||B d - B\g 

To conclude the proof, observe that B^ = U BjC ;Ug rf B = AA + B = AX op t- ■ 

Theorem 14 (Frobenius Norm). There exists A G M Tlx ' i and B G M TlXa; such that for any r > d and 
any sampling and rescaling matrices S G M. nxr and D G M rxr , the solution to the coreset regression 
X opt = (DSA)+DSB G R dxuj satisfies 

\\AXopt - B||| > (l + ||AX opt - B|||. 

Proof. The proof of this Frobenius norm lower bound follows the same argument as in the proof of Theo- 
rem 13, with cj/(r+l) replaced by 1 + d/r, providing that there is a matrix B for which ||B — lie d(B)||p ^ 
(1 + d/r)||B — Brf|| F . Indeed, the construction of such a matrix was presented in (Boutsidis et al., 2011). ■ 



6 Open problems 

Can one determine the minimum size of a coreset that provides a (1 + e) relative-error guarantee for simple 
linear regression? We conjecture that £1 (k/e) is a lower bound, which will make our results almost tight. 
Certainly, rich coresets of size exactly k cannot be guaranteed: consider two data points (1, 1), (— 1, 1). 
The optimal regression is 0; however any coreset of size one will give non-zero regression. Is it possible to 
get strong guarantees on small corsets for other learning problems? 
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A Linear Algebra Background 



The Singular Value Decomposition (SVD) of a matrix A E R nx<i of rank k is a decomposition A = 
U a £aVL The singular values a\ > 02 > • • • > o"fe > are contained in the diagonal matrix £a E R ; 
Ua G M nxfc contains the left singular vectors of A; and Va G M. dxk contains the right singular vectors. 
The SVD of A can be computed in deterministic 0(ndmin{n,d}) time. 

The Moore- Penrose pseudo-inverse of A is equal to A + = VaST UT . Given an orthonormal matrix 
U A G R nxk , the perpendicular matrix U A G R nx ( n ~ k ) to U A satisfies: (U A ) T U A = I n _ fc) U^U A = 
Ofcx(n-fc); an d UaU^ + U A (U A ) T = I n . All the singular values of both Ua and U A are equal to one. 

Given Ua, U a can be computed in deterministic O (n(n — k) 2 ) time via the QR factorization. 



We remind the reader of the Frobenius and spectral matrix norms: ||A||| = £^ • Afj = Yli=i °~i an ^ 
II II 2 = We wm sometimes use the notation ||A||^ to indicate that an expression holds for both 
£ = 2 or £ = F. For any two matrices X and Y, ||X||2 < ||X||f < \/rank(X)||X||2; ||XY||p < 
||X||f||Y||2; ||XY||f < ||X||2||Y||f. These are stronger variants of the standard submultiplicativity 
property ||XY||^ < ||X||^||Y||^ and we will refer to them as spectral submultiplicativity. It follows that, if 
Q is orthonormal, then ||QX||g < ||X||^ and ||YQ T ||g < ||Y||^. Finally, 

Lemma 15 (matrix-Pythagoras). Let X and Y be two n x d matrices. If XY T = nxn or X T Y = Odxd, 
then ||X + Y\\j < ||X||| + ||Y|||. 

A.l Sparsification Results 

We now state two recent results on matrix sparsification ((Boutsidis, 2011, Lemmas 71 and 72, p. 
132),(Boutsidis et al., 2011)) using our notation. 

Lemma 16 (Spectral Sparsification). Let Y E M. nxei and \P G R nx ^ 2 with respective ranks py, and py. 
Given r > py, there exists a deterministic algorithm that runs in time T$vd (Y) + Tsvd (^) + 0{rn{p\ + 
p%)) and constructs sampling and rescaling matrices S G M. rxn , D E R rxr satisfying: 




rank(DSY) = rank(Y) ; || (DSY) + || 2 < || v+ lb; ||DS*|| 2 < 1 + 

l-y/pv/r V 

If = \ n , the running time of the algorithm reduces to T$vd (Y) + 0\rnp\). We write [D,S] = 
MultipleSpectralSampling (Y , ^ , r) to denote such a deterministic procedure. 

Lemma 17 (Spectral-Frobenius Sparsification). Let Y E R nx ^ and E R nx ^ 2 with respective ranks py, 
and pq,. Given r > pY, there exists a deterministic algorithm that runs in time Tsvd(Y) + 0{rnp\ + l2n) 
and constructs sampling and rescaling matrices S E R rxn , D E R rxr satisfying: 

rank (DSY) = rank (Y) ; || (DSY) + || 2 < || v+ lb; ||DStf|| F < ||*||f- 

1 - y/pv/r 

If ^ = l n , the running time of the algorithm reduces to TgvD (Y) + 0(rnp Y ). We write [D, S] = 
MultipleFrobenius Sampling (Y , *I> , r) to denote such a deterministic procedure. 
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Algorithms 



Input: A G R nxd of rank k, b G M n , and r > jfe + 1. 
Output: sampling matrix S and rescaling matrix D. 

1: Compute the SVD of Y = [A, b]. Let Y = USV T , where U G R nx£ , S G R £x£ and 

V G R dx£ , with £ < k + 1 (the rank of Y). 
2: Return [0, S] = SimpleS ampling (U , r) (see Lemma 2) 



Algorithm 1: Deterministic coreset construction for constrained linear regression. 



Input: A G R nxd of rank fc, B G M nxw , and r > k. 
Output: sampling matrix S and rescaling matrix D. 

1: Compute the SVD of A: A = Ua^aVJ, where U A G R nxk , S A G M fcxfc , and V A G R d 

compute E = U A U A B - B. 
2: return [S,D] = MultipleSpectralSampling(TJj^,'E,r) (see Lemma 16) 



Algorithm 2: Deterministic coresets for multiple regression in spectral norm. 



Input: A G R nxd of rank k, B G M nXaJ , and r > k. 
Output: sampling matrix S and rescaling matrix D. 

1: Compute the SVD of A: A = U A S A VX, where U A G M nxfc , S A G R kxk , and V A G R d 

compute E = U A UXB - B. 
2: return [S,D] = MultipleFrobeniusSampling(U\,'E,r) (see Lemma 17) 



Algorithm 3: Deterministic coresets for multiple regression in Frobenius norm. 
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C Technical Proofs 

Proof. (Theorem 4) We first construct D and S via Theorem 1 applied to A and h avg . The running time 
is O (nui) (the time needed to compute h avg ) plus the running time of Theorem 1. The result is immediate 
from the following derivation: 

||AX opt — B|| F = iL>||Ax op t - b at , g || 2 + \\h avg — B^|| 

i=l 

< (l + o(v^)) w||Ax opt -b a ^|| 2 + ^||b a ^-B«|| 2 

i=i 



2 / 

< (l + 0(^V)) [wHAXopt-ba^f + ^llba^-BW'' 5 

8=1 



( = } (l + o(7fcA : )) 2 ||AX op4 -B|| 2 ,. 

(a) follows by Lemma 3; (b) follows because x op t is the output of a coreset regression as in Theorem 1. 
Finally, r > k + 1 implies that fl + O (v^A")) =1 + {\/k/r) ■ ■ 

Proof. (Lemma 9) To simplify notation, let W = DS. Using the SVD of A, A = UaSaV^, we get: 
||B - AX opt || 2 = ||B - U A S A Vl(WU A S A Vl)+WB|| 2 = ||B - U A (WU A )+WB|| 2 , 

where the last equality follows from properties of the pseudo-inverse and the fact that WU A is a full-rank 
matrix. Using B = (u A U£ + (U A ) T ) B, we get 



2 



2 



HB-AXoptHj = ||B-U A (WU A ) + W (^U A U A + U A (U A J 1 B|| 

= ||B-U A (WU A ) + WU A UlB + U A (WU A ) + WUi(uij B|| 

( = } ||U A (U A ) T B + U A (WU A )+WU A (U A ) T B|| 2 
= l|Ui(Ui) T B|| 2 + ||U A (WU A )+WU A (U A ) T B|| 5 . 

(a) follows from the assumption that the rank of WU A is equal to k and thus (WU A ) + WU A = and 

(b) follows by matrix-Pythagoras (Lemma 15). To conclude, we use spectral submultiplicativity on the 
second term and the fact that U A (U A ) B = AX opf — B. ■ 
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